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Abstract 

Typically, users of teleconferencing desktop applications must use headphones 
to eliminate the direct-path echo introduced by having a loudspeaker and a micro- 
phone in the same room. Unfortunately, many users prefer not to use headphones. 
Implementing echo cancelation would be one way for desktop audio conferencing 
systems to overcome this limitation. 

One technique used to remove the echo is to introduce an autoregressive filter 
to identify the echo path impulse response defined by the system {Speaker + Room 
+ Microphone}. The signal output to the speaker is then filtered by the estimated 
echo path impulse response, and the result is subtracted from the degraded input 
speech to obtain an estimate of the input speech without echoes. 

In this paper, we describe the identification process which leads to the least mean 
square (LMS) algorithms and the recursive least square (RLS) algorithms. We also 
present an "Echo-Cancelation Software Lab" which was implemented and optimized 
to allow real-time testing of the LMS, normalized LMS, homogeneous adaptation, 
and individual adaptation algorithms. 

Keywords: audio, echo cancelation, least mean square, adaptive filtering, gradient 
algorithm, conferencing 
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1 Introduction 

A typical desktop audio conferencing system uses a speaker and a microphone at 
each end of the transmission link. An audible problem occurs, however, when the 
output of the speaker is simultaneously recorded by the microphone and then trans- 
mitted digitally, hence creating a direct-path echo which interferes with spoken words 
Also, reverberations of the output speaker signal on walls, tables and human bodies 
located in the conference room add to this direct echo and constitute indirect echoes. 

To overcome this problem, a replica of the echoes can be synthesized by modeling 
the echo path and then subtracting the result from the outgoing signal. Adaptive 
filtering methods can be used to implement this model, and the system { Speaker 
+ Room + Microphone } can be modelled with an autoregressive filter with time- 
varying coefficients. The least mean square (LMS) and recursive least square (RLS) 
algorithm classes are most commonly used to update the filter coefficients. 

In this paper, Section 2 relates the identification process which leads to the LMS 
and the RLS algorithms. The LMS algorithms are then described in Section 3 in 
more detail. Section 4 presents an "Echo-Cancelation Software Lab" which was 
implemented to allow real-time testing of LMS algorithms such as the LMS, the 
normalized LMS, the homogeneous adaptation, and the individual adaptation. 



2 Identification, Model and Criterion 
2.1 Identification 

In parametric spectrum analysis, the power spectrum of a process is estimated by 
assuming a model H (z) for this process 1 . In the case of the echo path impulse re- 
sponse, the unknown system {Speaker + Room + Microphone} is identified as the 
output of a linear adaptive filter that is excited by a white-noise process 2 . 

Figure 1 defines the identification procedure used to evaluate the model. The un- 
known system and the model are driven by the same input u(n) which is the in- 
coming speech signal from the far-end of a conference. A criterion on the difference 
e(n) of the output is optimized, and a feedback is given to the model for adaptation. 
e(n) also defines the outgoing signal and is played back at the far-end speaker. 



1 A good reference for adaptive filter theory is [1]. 

2 By definition, a white-noise process has a constant power spectrum. 



2. IDENTIFICATION, MODEL AND CRITERION 



u(n) 



Speaker + Room + Microphone 



s(n) 



Model: H(z) 



s'(n) 



e(n) 



Optimization 



Figure 1: Identification of unknown system {Speaker + Room + Microphone} 
2.2 Model 

A model that is of practical utility because of its simple implementation is the au- 
toregressive (AR) model. Indeed, the filter's transfer function is assumed to consist 
only of poles. Let this transfer function be denoted by 



H(z) 



G 



1 + E;=i 



where the a; are called the autoregressive parameters or filter coefficients, p the 
model order and G, the system's scale factor. 

Another way to understand the AR model is to see the output s(n) of the unknown 
system as a linear prediction of the delayed output values and an unknown input 
signal u(n) which is defined as white noise: 



p 

s(n) = ^ dis(n — i) + Gu(n) 

i=l 

Therefore, we will approximate the output signal recorded by the microphone as a 
linear prediction of the past output signal: 



p 

s'(n) = ^ dis(n — i) 

i=l 



When using this model, we have the following error: 



2.3 Criterion 
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p 

e(n) = s(n) — s'(n) = s(n) — ^ dis(n — i) 

i=l 

For the optimal case where the unknown system is autoregressive, the variance of 
the error is the square of the scale factor: 

E[e(n) 2 } = G 2 
where E[x 2 ] is the expected value of x 2 . 

2.3 Criterion 

Since the statistical properties of the room change slowly in time, we define the 
model to be time varying. 

To estimate the time dependent filter coefficients, we can choose among adaptive 
methods that are differentiated by their criterion. There are two criteria commonly 
used: 

• If the criterion is defined as minimizing the mean squared error at time n, 
E[e(n) 2 ], compared to the autoregressive parameters, then the algorithm is a 
least mean square based algorithm. 

• If the criterion is defined as minimizing a weighted sum of squared errors 
such as 

£>(*)<£(*) 

k=l 

where w(0),w(l), ... is a sequence of weights, then the algorithm is called a 
recursive least square (RLS) algorithm. 

The computational complexity of the LMS and RLS algorithms are O (p) and O (p 2 ) , 
(with p being the order of the model). The use of the LMS algorithms is widespread 
due to its computational simplicity. 

Note that fast RLS algorithms have been introduced that circumvent the computa- 
tional burden of the conventional RLS algorithms. There are two families of such 
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3. THE MEAN-SQUARED-ERROR CRITERION 



fast algorithms, corresponding to two possible filter structures: the fast lattice (FLA) 
and the fast transversal filter (FTF) algorithms. While the LMS has a computational 
complexity of 2p multiplications, the FLA and the FTF need 8p to 30p multiplica- 
tions. Although less burdensome computationally than RLS, the FTF algorithm can 
be subject to numerical instability problems. A solution to this problem is proposed 
in [2]. 

3 The Mean-Squared-Error Criterion 

3.1 The Gradient or Steepest-Descent Method 

The criterion of the LMS algorithms is defined as minimizing the Mean-Square- 
Error E[e(n) 2 ] compared to the filter coefficient vector: 

a p (n) = (ai(7i),a 2 (7i), • ■■,a p (n)) T 
The idea is to search for the point % min on the surface /(c^(tc)) defined by 

/(a») = E[(s(n) - ]T <n(n)s(n - i)) 2 ] 
»=i 

To do this, the steepest-descent method is used by successive approximations, con- 
verging to the vector Op min - The mean-squared-error surface can be viewed as a 
(p + 1) -dimensional paraboloid where p is the number of coefficients (or taps) of 
the impulse response. The optimum impulse response then corresponds to the min- 
imum of the paraboloid. 

The gradient method has three steps: 

• initialization: 

- chose initial vector (0) 

• propagation: 

- compute gradient at point Qp(n), defined as V/(oj,(ti)) 

- move on the surface by going in the opposite direction of the gradient, 
modifying the vector a^(n) as follows: 

a p (n + 1) = a p (n) - iiVf(a p (n)) 



3.2 The Gradient Algorithm 
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• termination: 

- stop when |V/(o^(ti))| < threshold 

where fi is the displacement step. As we will show, fi controls the speed and con- 
dition of convergence. 

3.2 The Gradient Algorithm 

The following notations are used to define the gradient. 

2p(, n ) = (. s (. n - 1 ), s (. n - 2 )r--,s(. n -p)) T 
R p = Ei^s^nf] 
c p = S[s(ra)3,(ra)] 

The gradient is defined by equation 1 and the propagation step of the gradient algo- 
rithm is given by equation 2. 

V/(a») = -2£[ ep (n)2p(n)] = 2R p a p (n) - 2c p (1) 
a p (n+l) = a p (n) + 2 f iE[e p (n)s p (n)] (2) 

Using equations 1 and 2 and assuming that the gradient is 0 at the minimum, the 
estimation error of vector (n) is defined by 

(^(n+lJ-a^J = [I-2^ p ](a p ( K )-a pm .J 

From this expression, we have information about the convergence of the gradient al- 
gorithm: the algorithm converges if at each step the norm of vector {^{n + 1) - a Pmin ) 
is smaller than the norm of vector (a p (n) — « Pmin )- This is the case if all the eigen- 
values Afc of the correlation matrix R p satisfy the following condition 3 : 

1 

0 < \i < — 



3 Pointers to initial proofs are given in [3]. 
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3. THE MEAN-SQUARED-ERROR CRITERION 



So, by restricting \i to 

1 

0 < \i < 



^max 



we are sure that the algorithm converges. 

Finding an estimate of A maa . is not easy. To bypass the estimation, we use the fact 
that 



max 

k=l 

and a sufficient condition (which is restrictive) is to take 

1 1 



0 < [L < 



Tr(R p ) pE[s{nf] 



Hence, fi can be controlled with an estimate of E[s(n) 2 ] which is the average signal 
power. 

3.3 The LMS Algorithms 

In practice, the gradient's value is not known, but we can estimate it using the input 
signal. The LMS algorithms estimate the gradient as the instantaneous expected 
value: 

Vf{a p {n)) = -2E[e p {n)s p {n)\ = -2e p {n)s p {n) 
With this estimation, the LMS algorithms comprise these steps: 

• Initialization: 

«*(0) = 0 

• Filter coefficient update: 



a k (n + 1) = a k (n) + 2ne p {n)s{n) 



3.3 The LMS Algorithms 
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• Output of system: 

p 

e p (n) = s(n) - ^ ak{n)s(n - k) 
k=l 

As shown earlier, the convergence factor \i can be controlled with an estimation of 
the average signal power. The following four algorithms differ only in the definition 
of this estimation: 

• The LMS defines the average signal power as a constant P s . Therefore, the 
convergence factor is 

> = W. <3) 

• The Normalized LMS algorithm (NLMS) defines the input power as 

P s {n + l) = (3P s {n) + {l-(3)s 2 {n) (4) 
with P usually in the range 4 of 0.99 to 0.9999. The convergence factor is then 



• The Homogeneous Adaptation algorithm (HA) [4] estimates the input power 
as 



P.(n)=-X> 2 (n-i) (6) 

P i=0 



The convergence factor is defined as 



M») = TT^T^ (V) 



4 Note if /3 is zero, no averaging takes place. As /3 approaches one, the effective length of the 
average becomes longer. For f3 equal to one, the NLMS reduces to the LMS algorithm. 
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3. THE MEAN-SQUARED-ERROR CRITERION 



• Finally, instead of denning \i as a scalar, the Individual Adaptation algorithm 
(IA) [4] defines it as a vector, hence controlling each coefficient separately. 



r \ _ H n -j)\ 

H\ U ) — „ v^p-l I / "\ 1 3 



3.4 Computational Complexity 

Assuming the filter coefficients are updated at the same time as the output of the 
filter, the number of multiplications and additions for the different LMS algorithms 
are shown in Table 1 . The last column gives the total number of operations. This ta- 
ble shows that each of the LMS, NLMS, HA and IA algorithms has a computational 
complexity which is linear in the number of filter taps. 



Algorithm 


* 


+ 


Total number of operations 


LMS 


2p+l 


2p 


4p+ 1 


NLMS 


2p + 5 


2p+l 


4p+ 6 


HA 


2p + 4 


2p+2 


Ap+ 6 


IA 


3p + 6 


2p+2 


5p+8 



Table 1 : Computational complexity (p is the number of filter taps) 

If no coefficient update is made, the number of operations at each step is algorithm 
independent and is p - 1 multiplications and p additions. 

Note that in the case of HA, we do not compute the sum of p squared terms each 
time, instead we update the sum as follows: 

p-i 

S(n) = s 2 (n -i) = s 2 (n) + S(n -l)-s 2 (n- p) 

i=0 

A similar decomposition was used for IA. 
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4 Experiments 

4.1 Evaluation: ERLE Measurements 

In order to evaluate the performance of an echo canceling system, the ratio of the 
expected value of the microphone output squared E[s 2 (n)] divided by the expected 
value of the error signal squared E[e 2 (n)} is monitored. This quantity, in dB, is 
called the Echo Return Loss Enhancement, or ERLE: 

ERLE=lZ\o g E[s2{n)] 



E[e 2 (n)] 

The expected value is estimated as follows: 



1 N 

k=i 



x 2 



4.2 The "Echo-Cancelation Software Lab" 

The "Echo-Cancelation Software Lab" is used to perform real-time testing of adap- 
tive filtering algorithms. For this report, we have implemented the following algo- 
rithms: LMS, normalized LMS, homogeneous adaptation, and individual adapta- 
tion. 

This testbed is written in the 'C programming language, with a graphical user in- 
terface written in Tcl/Tk [5], and uses the AF System [6] for audio support. Figure 2 
shows this graphical user interface 

As shown in Figure 3, this lab establishes a full-duplex audio connection between 
two offices. The echo cancelation algorithms were implemented and tested on one 
side of the conference where a speaker and a microphone were used. In the other 
office, the user could chose between a headset or a speaker/microphone set. 

The user can control the following algorithm parameters: 

• The length p of the filter 

• The error threshold setting the termination step of the gradient algorithm 5 



3 To save computation, the termination step is not controlled by the norm of the gradient, but by 
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• The stepsize c coefficient controlling the convergence factor 6 . Indeed, to al- 
low a linear control, the convergence factor in the testbed was defined as 
with c varying from 0 (no filter coefficient update) to 100. 

• The gap in sample periods between record and play time-stamps, simulating 
a transmission time delay between two offices 

• The filter delay introduced by the Analog to Digital and Digital to Analog 
Converter 



This test-bed also supports center-clipping echo suppression which, although not 
described in this paper, can be used to suppress the small amplitude tails of rela- 
tively long-delay echos. 

the norm of the error which is the output of the system. Therefore, the gradient need not be computed 
at each output update if the error is small enough. 

6 This coefficient c was not introduced before so as not to overload the notations. 



4.3 Results from the "Echo-Cancelation Software Lab" 
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Playbufl 



PlaybufO 



Figure 3: Testbed system diagram 



4.3 Results from the "Echo-Cancelation Software Lab' 



This experiment was done between two neighboring offices (call them A and B). 
In both offices, we used MD518 SENNHEISER microphones with a SHURE FPU 
microphone pre-amplifier set to the same gain (+54 dB). In office A, a J- Video AF 
audio server and Realistic Minimus -7 speakers were used, and the microphone was 
oriented towards the user (i.e. not toward the speaker). In office B, a J300 Sound 
and Motion board [7] was used with Labtec CS-180 speakers. The microphone in 
office B was oriented towards the speaker 12 inches away. 



4.3.1 Filter Calibration 

Starting with arbitrary initial filter coefficients, the algorithm will converge faster if 
a white noise is played at the input of the adaptive system 7 . During this calibration 
time, the filter adapts to the current echo path impulse response. 

The ERLE measurements for each algorithm were collected during the calibration 
and are shown in Figure 4. For this experiment, the same parameters were used for 
all algorithms. As we can see, the algorithms HA and IA converge faster than the 
NLMS, which converges faster than the LMS. This graph shows the relative per- 
formance between the different algorithms for a given set of parameters. Note that 



7 [8] gives a good explanation for why white noise is better during calibration. 
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this graph does not show the best performance of each algorithm. By changing the 
parameters, one can find the best set for each algorithm and for the studied acoustic 
environment. 



Calibration with White Noise 




seconds 

Figure 4: Calibration of the algorithms LMS, NLMS, HA and IA. Parameters used 
were as follows: filter length of 200, stepsize of 100, an error threshold of 1000, a 
gap of 1000 samples, and a delay of 280 samples. 



4.3.2 The HA Algorithm 

We now use the homogeneous adaptation algorithm as an example to demonstrate 
the effects of varying the filter length, step size, and delay. Comparing the computa- 
tional complexity of the HA algorithm given in Table 1 and the convergence speed 
suggested by Figure 4 with the other algorithms, we believe that this algorithm is 
best from a complexity /performance perspective. 

• Varying the filter length 

Figure 5 shows two interesting properties as the number of the adaptive filter 
taps increases: 

- The Echo Return Loss Enhancement also increases. With 50 taps, the 
ERLE value is 12 dB at steady state, whereas with 250 taps it is 18 dB. 



Results from the "Echo-Cancelation Software Lab" 
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- The convergence time also increases. With 50 taps, the steady state is 
reached after 300 milliseconds and after 2 seconds for 250 taps. 

Note that the ERLE value does not start at zero. This is because the HA al- 
gorithm started to converge, with the ambient noise from office B, before ini- 
tiating the noise burst. 

Varying the stepsize 

Figure 6 shows that as the stepsize increases, the convergence time decreases. 
(Recall that for the HA algorithm to be stable, the stepsize c/100 may not be 
greater than 1.0). We note also that the smaller the stepsize is, the closer the 
algorithm converges to the optimal point (increasing the ERLE value). 

Varying the delay 

Figure 7 represents the filter coefficients for the HA algorithm with two dif- 
ferent delays (250 and 280 samples). The offset in the filter coefficients is 
due to the delay introduced by some components in the system (the Analog 
to Digital Converter and Digital to Analog Converter filter implementations). 
If this delay is not estimated correctly, most of the leading filter coefficients 
will be zeroes and CPU power will be wasted. 
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HA with stepsize 1.0, .51 and .23 




seconds 



Figure 6: HA algorithm with equal to 1.0, .51 and .23 



4.4 Performance Analysis 



One major goal for an echo canceler algorithm is to provide real-time operation. 
Therefore, we implemented the four LMS algorithms, then analyzed the perfor- 
mance and measured the number of cycles per filter tap for each algorithm. 



4.4.1 Implementation and Analysis 

The four algorithms were implemented in C and compiled on OSF/1 for Alpha V3.0 
using the native C compiler. The compilation arguments used were 

-02 -non_shared -float -f loat_const. 

The following tools were used to analyze and profile the programs: 

• The omdiag utility from the ATOM [9] kit was used to diagram how instruc- 
tions issued. 

• The Alpha process cycle counter was used to measure the number of cycles 
spent updating the output and the filter coefficients. The process cycle counter 
provides CPU clock cycle resolution, and can be used to gather real-time or 
process virtual time data. 



4.4 Performance Analysis 
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HA with delay 280 and 250 




10 20 30 40 50 60 70 80 90 100 




0 10 20 30 40 50 60 70 
filter coefficients 



90 100 



Figure 7: HA algorithm with delay 280 (top) and 250 (bottom) 

• The prof -pixie command gave us profiling information. 

We found, as expected, that most of the CPU cycles were used to execute two for 
loops, one computing the output of the system, the other updating the filter coef- 
ficients. These two loops are shown below. The C comments give the number of 
cycles executed in each line over the total number of cycles executed by the pro- 
gram. 



/* update output of system */ 

for (hi=0 , n=start ; hi<f ilterlen; hi++, n++) { 
if (n==MAXLEN) n=0; /* 
y += h [hi] * x [n] ; /* 

1 



2 . 67% 
44.10% 



*/ 
*/ 



/* update the filter coefficients */ 

for (hi=0 , n=start ; hi<f ilterlen; hi++, n++) { 

if (n==MAXLEN) n=0 ; /* 2.67% 

h [hi] += u * x [n] ; /* 49 . 43% 

} 



In the case of the HA algorithm, we observed that these two loops took 98.87% of 
the total number of cycles executed. 
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4.4.2 Bench Mark Results 

The performance test program counted the number of CPU cycles executed to com- 
pute the output of the system and update the filter coefficients. To do this, we com- 
puted an average number of cycles per tap for filters of length 17, 33, 49,..., 1025 
and the average of these results was defined as the output of the bench mark. 

We ran the test on three systems: 

• A DEC 3000 Model 400 workstation, 21064 processor running at 133 MHz 

• A DEC 3000 Model 700 workstation, 2 1 064A processor running at 225 MHz 

• A DEC 3000 Model 900 workstation, 2 1 064A processor running at 275 MHz 





LMS 


NLMS 


HA 


IA 


DEC 3000 Model 400 


19 


20 


20 


49 


DEC 3000 Model 700 


19 


19 


19 


48 


DEC 3000 Model 900 


19 


19 


19 


48 



Table 2: Cycles per tap 

Table 2 gives the results of the bench mark in number of cycles per tap while Table 3 
gives the computation time in microseconds for a 500 tap filter 8 . 





LMS 


NLMS 


HA 


IA 


DEC 3000 Model 400 


72 


75 


75 


184 


DEC 3000 Model 700 


42 


42 


42 


107 


DEC 3000 Model 900 


35 


35 


35 


87 



Table 3: Computation time for a 500 tap filter (in microseconds) 

As depicted in Table 2, the I A algorithm costs 2.5 times more in executed cycles 
than the other algorithms. The extra computation cost is incurred because the com- 
piler did not unroll the for loop updating the filter coefficients. If the compiler 

8 Note: if x is the number of cycles per tap and S the CPU clock speed in Hz, a 500 tap filter 
takes 500a;5 _1 seconds to compute. For a sampling rate of 8000 Hz, the filter update and output 
computation must not exceed 125 microseconds. 
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unrolled the loop below, or if it were unrolled by hand, then the computation time 
for IA would be closer to HA. 

/* update the coefficients */ 

for (hi=0 , n=start ; hi<f ilterlen; hi++, n++) { 

if (n==MAXLEN) n=0 ; /* 7.68% */ 

tmp = x[n]; /* make it a local variable */ /* 11.52% */ 
if ( (bufxl = tmp) < 0) bufxl = -bufxl; /* 7.68% */ 

h[hi] += u * bufxl * tmp; /* 38.42% */ 

1 



5 Conclusion 

In this paper, we described the identification process which leads to the LMS algo- 
rithms and the recursive least square algorithms. We presented an optimized real- 
time Echo-Cancelation Software Lab which allows testing of the LMS, normalized 
least mean square, homogeneous adaptation, and individual adaptation algorithms. 
We also presented the results of a bench mark measuring the number of cycles per 
tap for each of these four algorithms on three different DEC 3000 Model worksta- 
tions. With this Software Lab, we found the homogeneous adaptation algorithm to 
produce the best results. 
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